With climate change remaining a prevalent concern amongst many Americans for the foreseeable future, there has been some debate about how to approach the issue. Some consider the failure to act now a proverbial nail in the coffin, while others are optimistic that the rate of technological change will eventually help sort out this problem. William Nordhaus, who has been referred to as the leading mind in the economics of climate change, addresses the issue as a matter of careful cost-benefit consideration. In his book, The Climate Casino: Risk, Uncertainty, and Economics for a Warming World, he describes some of the nuanced and unknown issues pertaining to the economics of action against climate change and even puts into question some of the methods largely considered a best course of action.
Photovoltaic panels, or solar panels, are one of these aforementioned methods that, although has been proven to be efficient in some scenarios, is not a feasible first choice for clean electricity production on a massive scale. However, solar panels can supplement the net electricity consumption and many solar advocates claim that solar panels can decrease the price of electricity, not just for your own home or place of business, but for your neighbors as well. This has been the primary incentive for implementing solar panels into communities, however, we’d like to see if we can predict the use of solar panels in a given region, possibly to get some insight into characteristics that are associated with solar adoption. Although there are many incentives in place to increase public support and adoption of solar panels, there seems to be stagnation in the adoption rate of residential community-wide solar integration. We’ll attempt to predict the adoption of solar panels across the US using various machine learning algorithms.
Stanford's DeepSolar project used satelite imaging and computer vision to identify solar panels within the continental United States. Rather than undertaking a project to image the entire U.S., is there a way to predict the adoption of photovoltaic panels across the U.S. using demographic, economic, and geospatial data. The data can be found on the DeepSolar website: http://web.stanford.edu/group/deepsolar/home.
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow
def plot_ann_auc(model,X_test,y_test,model_name='model_name'):
n_probs = [0 for _ in range(len(y_test))]
x_probs = model.predict_proba(X_test)
x_probs = x_probs[:,0]
n_auc = roc_auc_score(y_test, n_probs)
x_auc = roc_auc_score(y_test, x_probs)
print('No Skill Rate AUC: ',n_auc)
print('Learned AUC: ',x_auc)
n_fpr, n_tpr, _ = roc_curve(y_test,n_probs)
l_fpr, l_tpr, _ = roc_curve(y_test,x_probs)
plt.figure(figsize=(16,8))
plt.plot(n_fpr,n_tpr,linestyle='--',label='No skill rate')
plt.plot(l_fpr,l_tpr,marker='.',label=model_name)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
def plot_prob_auc(model,X_test,y_test,model_name='model_name'):
n_probs = [0 for _ in range(len(y_test))]
x_probs = model.predict_proba(X_test)
x_probs = x_probs[:, 1]
n_auc = roc_auc_score(y_test, n_probs)
x_auc = roc_auc_score(y_test, x_probs)
print('No Skill Rate AUC: ',n_auc)
print('Learned AUC: ',x_auc)
n_fpr, n_tpr, _ = roc_curve(y_test,n_probs)
l_fpr, l_tpr, _ = roc_curve(y_test,x_probs)
plt.figure(figsize=(16,8))
plt.plot(n_fpr,n_tpr,linestyle='--',label='No skill rate')
plt.plot(l_fpr,l_tpr,marker='.',label=model_name)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
data = pd.read_csv('deepsolar_tract.csv', delimiter=',',encoding='latin-1')
use_cols = ['tile_count','average_household_income','county','education_bachelor','education_college',
'education_doctoral','education_high_school_graduate','education_less_than_high_school',
'education_master','education_population','education_professional_school','land_area',
'per_capita_income','population','population_density','poverty_family_below_poverty_level',
'poverty_family_count','state','total_area','unemployed','water_area','employ_rate',
'poverty_family_below_poverty_level_rate','median_household_income','electricity_price_residential',
'electricity_price_commercial','electricity_price_industrial','electricity_price_transportation',
'electricity_price_overall','electricity_consume_residential','electricity_consume_commercial',
'electricity_consume_industrial','electricity_consume_total','household_count','average_household_size',
'housing_unit_count','housing_unit_occupied_count','housing_unit_median_value',
'housing_unit_median_gross_rent','heating_design_temperature',
'cooling_design_temperature','earth_temperature_amplitude','frost_days','air_temperature',
'relative_humidity','daily_solar_radiation','atmospheric_pressure','wind_speed','earth_temperature',
'heating_degree_days','cooling_degree_days','age_18_24_rate','age_25_34_rate','age_more_than_85_rate',
'age_75_84_rate','age_35_44_rate','age_45_54_rate','age_65_74_rate','age_55_64_rate','age_10_14_rate',
'age_15_17_rate','age_5_9_rate','household_type_family_rate','dropout_16_19_inschool_rate',
'occupation_construction_rate','occupation_public_rate','occupation_information_rate',
'occupation_finance_rate','occupation_education_rate','occupation_administrative_rate',
'occupation_manufacturing_rate','occupation_wholesale_rate','occupation_retail_rate',
'occupation_transportation_rate','occupation_arts_rate','occupation_agriculture_rate',
'occupancy_vacant_rate','occupancy_owner_rate','mortgage_with_rate','transportation_home_rate',
'transportation_car_alone_rate','transportation_walk_rate','transportation_carpool_rate',
'transportation_motorcycle_rate','transportation_bicycle_rate','transportation_public_rate',
'travel_time_less_than_10_rate','travel_time_10_19_rate','travel_time_20_29_rate',
'travel_time_30_39_rate','travel_time_40_59_rate','travel_time_60_89_rate','health_insurance_public_rate',
'health_insurance_none_rate','age_median','travel_time_average','voting_2016_dem_percentage',
'voting_2016_gop_percentage','voting_2012_dem_percentage','voting_2012_gop_percentage',
'number_of_years_of_education','diversity','incentive_count_residential',
'incentive_count_nonresidential','incentive_residential_state_level','incentive_nonresidential_state_level',
'net_metering','feedin_tariff','cooperate_tax','property_tax','sales_tax','rebate','avg_electricity_retail_rate']
The following features were dropped due to the redundancy, direct inference to the predicted variable or it's not necessary for domain understanding.
dropped = [col for col in data.columns if col not in use_cols]
np.array(dropped).flatten()
df = data[use_cols]
df.tile_count.describe()
df.loc[df.tile_count == 0,'adoption'] = 0.0
df.loc[(df.tile_count>0),'adoption'] = 1
df.adoption.value_counts()
df.state = df.state.str.upper()
state_df = df.groupby('state').sum()
fig = px.choropleth(locations=state_df.index, locationmode="USA-states", color=state_df.tile_count, scope="usa")
fig.show()
corr = df.drop(['tile_count','county','state'],axis=1).corr()
plt.figure(figsize=(30,30))
fig = sns.heatmap(data=corr,cmap='binary')
from sklearn.preprocessing import MinMaxScaler,LabelEncoder
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, recall_score, f1_score, precision_score, roc_auc_score, roc_curve
from sklearn.model_selection import GridSearchCV, train_test_split, ShuffleSplit
encoder = LabelEncoder()
df['county_code'] = encoder.fit_transform(df.county)
df['state_code'] = encoder.fit_transform(df.state)
df.loc[df.electricity_price_transportation==df.electricity_price_transportation.value_counts().index[0],'electricity_price_transportation'] = 0
df.electricity_price_transportation = df.electricity_price_transportation.astype(float)
[col for col in df.columns if df[col].dtype=='O']
df.shape
state_meds = df.groupby('state').median()
for col in state_meds.columns:
print(col,' : ',round((state_meds[col].isna().sum()/72537)*100,2),'%')
county_meds = df.groupby('county').median()
for col in county_meds.columns:
print(col,' : ',round((county_meds[col].isna().sum()/72537)*100,2),'%')
county_impute = ['average_household_income','land_area','per_capita_income','population_density',
'total_area','water_area','employ_rate','poverty_family_below_poverty_level_rate',
'median_household_income','average_household_size','housing_unit_median_value',
'housing_unit_median_gross_rent','age_18_24_rate','age_25_34_rate','age_more_than_85_rate',
'age_75_84_rate','age_35_44_rate','age_45_54_rate','age_65_74_rate','age_55_64_rate','age_10_14_rate',
'age_15_17_rate','age_5_9_rate','household_type_family_rate','dropout_16_19_inschool_rate',
'occupation_construction_rate','occupation_public_rate','occupation_information_rate','occupation_finance_rate',
'occupation_education_rate','occupation_administrative_rate','occupation_manufacturing_rate',
'occupation_wholesale_rate','occupation_retail_rate','occupation_transportation_rate','occupation_arts_rate',
'occupation_agriculture_rate','occupancy_vacant_rate','occupancy_owner_rate','mortgage_with_rate',
'transportation_home_rate','transportation_car_alone_rate','transportation_walk_rate','transportation_carpool_rate',
'transportation_motorcycle_rate','transportation_bicycle_rate','transportation_public_rate',
'travel_time_less_than_10_rate','travel_time_10_19_rate','travel_time_20_29_rate','travel_time_30_39_rate',
'travel_time_40_59_rate','travel_time_60_89_rate','health_insurance_public_rate','health_insurance_none_rate',
'age_median','travel_time_average','voting_2012_dem_percentage','voting_2012_gop_percentage',
'number_of_years_of_education','diversity']
state_impute = ['heating_design_temperature','cooling_design_temperature',
'earth_temperature_amplitude','frost_days','air_temperature','relative_humidity',
'daily_solar_radiation','atmospheric_pressure','wind_speed','earth_temperature',
'heating_degree_days','cooling_degree_days','voting_2012_dem_percentage','voting_2012_gop_percentage']
for col in county_impute:
df[col] = df.groupby('county')[col].transform(lambda x: x.fillna(x.median()))
for col in state_impute:
df[col] = df.groupby('state')[col].transform(lambda x: x.fillna(x.median()))
df.loc[df.voting_2012_dem_percentage.isnull(),'voting_2012_dem_percentage'] = df.voting_2012_dem_percentage.median()
df.loc[df.voting_2012_gop_percentage.isnull(),'voting_2012_gop_percentage'] = df.voting_2012_gop_percentage.median()
for col in df.columns:
print(col,' : ',round((df[col].isna().sum()/72537)*100,2),'%')
df = df.dropna()
X = df.drop(['tile_count','adoption','county','state'],axis=1)
y = df['adoption']
X = X.astype(float)
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42,test_size=0.3)
scaler = MinMaxScaler()
scaled_data = pd.DataFrame(scaler.fit_transform(X),columns=X.columns)
scaled_data['adoption'] = df['adoption']
scaled_data = scaled_data.dropna()
X_scaled = scaled_data.drop('adoption',axis=1)
y_scaled = scaled_data['adoption']
X_trains, X_tests, y_trains, y_tests = train_test_split(X_scaled, y_scaled, random_state=42, test_size=0.3)
Althought gradient boosting will perform better, the algorithms is extremely computationally expensive, so for the sake of this analysis, we'll use a Random Forest. The great thing about a Random Forests are their ability to arrive at interpretable conclusions. Through viewing which features ultimately lead to the highest information gain across the estimators, we can see the features which provide the highest information gain across the forest of decision trees. Intuitively, we can observe the feature importance as a proxy to the real life predictive power of each attribute associated with the classification of solar adoption.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)
print(classification_report(y_test,rf_pred))
rf_auc = GridSearchCV(rf,param_grid={'n_estimators':np.arange(80,240,20),
'criterion':['gini','entropy'],
'max_depth':[5,10,25,50,100],
'min_samples_split':[10,50,250,500]},
scoring='roc_auc',
cv=5,
n_jobs=-1)
rf_auc.fit(X_train,y_train)
I've elected to tune the model with the cross validated grid search that yeilds the highest ROC Area Under the Curve. Since more of the United States has adopted solar than not, either F1-score or the AUC are better to use, as to avoid the class imbalance driven misguidance of relying in accuracy as a measurement of model performance. Roughly 77% of counties (fips identifiaction regions) have adopted PV as part of the energy infrastructure at some scale. We'd like to avoid developing a model that evolves into a simple majority vote classifier, so accuracy is not an ideal indicator for our use case.
rf_clf = RandomForestClassifier(n_estimators=500,
max_depth=25,
criterion='entropy',
min_samples_split=10,
random_state=42,
n_jobs=-1)
rf_clf.fit(X_train,y_train)
The tuned Random Forest performs well in predicting the adoption of solar panel generation across the U.S. Depending on the use case scenario of this data, we may want to tune for optimizing different metrics. For instance, this model performs better on the majority class: counties with solar, and not as well on the minority class. Considering the metrics seen below, rather than further tuning, we may want to resort to some resampling methods or dimensionality reduction in the future.
rfclf_pred = rf_clf.predict(X_test)
print(classification_report(y_test,rfclf_pred))
plot_prob_auc(rf_clf,X_test,y_test,model_name='Random Forest')
We can see below the construction of a perceptron using the sigomid function at the output layer and relu at the input layer. Using the Nesterov momentum algorithm with the Adam gradient optimization algorithm, or Nadam for short. We can see the accuracy and auc after each epoch and notice that we have an early stopping criteria as well as the use of binary crossentropy as our loss function to measure for backpropogation weight and bias updates.
from tensorflow.keras import models
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras import initializers, optimizers
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.metrics import AUC
perc = Sequential()
perc.add(Dense(112,
input_dim=112,
activation='relu',
kernel_initializer=initializers.glorot_uniform(seed=42),
bias_initializer='zeros'))
perc.add(Dense(1,
input_dim=112,
activation='sigmoid',
kernel_initializer=initializers.glorot_uniform(seed=42),
bias_initializer='zeros'))
earlystop_callback = EarlyStopping(monitor='accuracy',
min_delta=0.0001,
patience=7)
auc = AUC()
perc.compile(optimizer='nadam',loss='binary_crossentropy',metrics=['accuracy',auc])
perc.fit(X_trains.values,y_trains.values,epochs=100,callbacks=[earlystop_callback])
plot_ann_auc(perc,X_tests.values,y_tests.values,model_name='Perceptron')
perc_pred = perc.predict_classes(X_tests.values).flatten()
print(classification_report(y_tests,perc_pred))
As can be seen in the perceptron network, the network appears to be overfitting to the train dataset, returning an AUC and accuracy that's higher than it is on our test data. Though the performance on the minority class is better on the test data, we can see that the AUC is still lower than the train data and lower than the Random Forest performance. To avoid this overfitting we can attempt to include some more layers to include more weight and bias values to update and include dropout layers to help the model parse out co-evolutionary nodes.
ann = Sequential()
ann.add(Dense(112,
input_dim=112,
activation='relu',
kernel_initializer=initializers.glorot_uniform(seed=42),
bias_initializer='zeros'))
ann.add(Dense(224,
input_dim=112,
activation='relu',
kernel_initializer=initializers.glorot_uniform(seed=42),
bias_initializer='zeros'))
ann.add(Dropout(0.5,seed=42))
ann.add(Dense(112,
input_dim=112,
activation='relu',
kernel_initializer=initializers.glorot_uniform(seed=42),
bias_initializer='zeros'))
ann.add(Dense(224,
input_dim=112,
activation='relu',
kernel_initializer=initializers.glorot_uniform(seed=42),
bias_initializer='zeros'))
ann.add(Dropout(0.5,seed=42))
ann.add(Dense(112,
input_dim=112,
activation='relu',
kernel_initializer=initializers.glorot_uniform(seed=42),
bias_initializer='zeros'))
ann.add(Dense(1,
input_dim=112,
activation='sigmoid',
kernel_initializer=initializers.glorot_uniform(seed=42),
bias_initializer='zeros'))
ann.compile(optimizer='nadam',loss='binary_crossentropy',metrics=['accuracy',auc])
ann.fit(X_trains.values,y_trains.values,epochs=100,callbacks=[earlystop_callback])
With a slight increase in AUC from the perceptron model, we can see that including dropout layers slightly increases AUC. However, we've improved overall model performance at the cost of some small amount of minority class performance. It's important to remember that these models both have their slight advantages and disadvantages depending on the use case, however they perform virtually the same even with dropout regularization applied.
plot_ann_auc(ann,X_tests.values,y_tests.values,model_name='ANN W/ Dropout')
ann_pred = ann.predict_classes(X_tests.values).flatten()
print(classification_report(y_tests,ann_pred))
print("Perceptron:\n",classification_report(y_tests,perc_pred),"ANN W/ Dropout Regularization:\n",classification_report(y_tests,ann_pred))
Random Forest performed better overall on the dataset, though the neural networks performed well also. We can see below the comparison of each models performance.
print("Random Forest:\n",classification_report(y_test,rf_pred),"\nPerceptron:\n",classification_report(y_tests,perc_pred),"\nANN W/ Dropout Regularization:\n",classification_report(y_tests,ann_pred))
importances = rf_clf.feature_importances_
indices = np.argsort(importances)
features = X_trains.columns
As mentioned earlier, a nice bonus to the Random Forest is the fact that we can see which features yield the highest information gain and see which features are most important. We can see the feature importance in descending order in the graph below, and surprisingly, it appears that the quality of grid infrastructure and the prevalence of economic incentives are less important to make a prediction of solar adoption. Additionally, the features that seem to be most important infer that there is a mix of importance primarily revolving around regional education, incomes and weather.
plt.figure(figsize=(20,20))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='g', align='center')
plt.yticks(range(len(indices)), features[indices],fontsize=12)
plt.xlabel('Relative Importance')